-
Couldn't load subscription status.
- Fork 291
update gpu block size based on xattn #2764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
5a146f4 to
e8311d7
Compare
e8311d7 to
eded411
Compare
b56c1c5 to
ac7a454
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave C++ review to @vshampor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for Vasily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my previous comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the GPU block size configuration to support XAttention, which uses a larger block size (256) compared to the standard GPU block size (16). The changes enable detection of XAttention at runtime and configure the appropriate block size accordingly.
- Adds XAttention detection logic based on cache dimensions
- Introduces sparse attention configuration support in benchmarking tools
- Refactors sparse attention setup into a reusable function
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/cpp/src/continuous_batching/cache_manager.hpp | Adds XAttention detection and sets GPU block size to 256 when XAttention is enabled |
| tools/who_what_benchmark/whowhatbench/model_loaders.py | Extracts sparse attention configuration into a separate function and adds validation logic |
| tools/llm_bench/llm_bench_utils/ov_utils.py | Adds validation to prevent conflicting sparse attention configuration |
| tools/llm_bench/task/text_generation.py | Moves GenerationConfig import outside conditional block for broader scope |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Could do it. But how about a separate PR after CPU xattn is merged? Hope this won't a blocker to GPU xattn.
If I understand correctly, "SparseAttentionMode" is set during compiling time when users create a LLMPipeline. The switch between TRISHAPE and XATTENTION will trigger plugin to re-compile the model. Any problem here?
That's a good question. I think here you are mentioning GPU KV block size. It is 256. There are two sets of KV block size: 256 for xattention + xe2/xe3+ new GPU architectures, and 16 for legacy GPUs or without xattention. Sergey once suggested to change the block size to 256 everywhere. We have a phase-by-phase plan to do that. Considering execution priority and effort, this work will be happening after full functionality of xattention is productized. We've talked about this in the email thread. |
@vshampor I think we have to go in this way, or else the test case will always fail. Created a ticket to track CVS-175120 |
Co-authored-by: Copilot <[email protected]>
### Details: - *XAttention for FP16 KVCache as a preview feature* - [x] to add unit tests - [x] to disable XAttention for legacy platforms (XAttention kernels are implemented for Xe2/Xe3 with CM) - [x] to streamline the process of xattention. Currently kvcache shape is used to determine it. Maybe there is a better approach. - [x] to add warning message for unsupported cases: multiple subsequences, typo error of kvcache precision, etc. - [ ] to remove the trivial converter nodes from xattention_threshold Parameter to PageAttention input. - [x] to refactor xattention kernel impls by reusing RT parameters, instead of recomputing them. - [x] to enable path of U8 KVCache (stretch goal) - [x] WWB with long prompts This PR should work along with openvinotoolkit/openvino.genai#2764. ### Tickets: - *CVS-173857* --------- Signed-off-by: Zhai, Xuejun <[email protected]> Co-authored-by: river.li <[email protected]> Co-authored-by: Luo Cheng <[email protected]> Co-authored-by: Li, Tingqian <[email protected]> Co-authored-by: rnwang04 <[email protected]> Co-authored-by: Wang Wangwang <[email protected]> Co-authored-by: Chen Peter <[email protected]> Co-authored-by: Zhai, Xuejun <[email protected]> Co-authored-by: Luwei Zhou <[email protected]>
### Details: - *XAttention for FP16 KVCache as a preview feature* - [x] to add unit tests - [x] to disable XAttention for legacy platforms (XAttention kernels are implemented for Xe2/Xe3 with CM) - [x] to streamline the process of xattention. Currently kvcache shape is used to determine it. Maybe there is a better approach. - [x] to add warning message for unsupported cases: multiple subsequences, typo error of kvcache precision, etc. - [ ] to remove the trivial converter nodes from xattention_threshold Parameter to PageAttention input. - [x] to refactor xattention kernel impls by reusing RT parameters, instead of recomputing them. - [x] to enable path of U8 KVCache (stretch goal) - [x] WWB with long prompts This PR should work along with openvinotoolkit/openvino.genai#2764. ### Tickets: - *CVS-173857* Same as [PR32064](#32064) 14e57f9 $ git fetch origin pull/32064/head:pr/32064 $ git fetch origin pull/32551/head:pr/32551 $ git diff pr/32551 pr/32064 (empty) To resolve [PR32064](#32064) commits checks issue. one commit is signed by unknown author. 5201cdf https://docs.github.com/en/github/authenticating-to-github/managing-commit-signature-verification/about-commit-signature-verification --------- Signed-off-by: Zhai, Xuejun <[email protected]> Signed-off-by: Chen, Peter <[email protected]> Co-authored-by: river.li <[email protected]> Co-authored-by: ceciliapeng2011 <[email protected]> Co-authored-by: Luo Cheng <[email protected]> Co-authored-by: Li, Tingqian <[email protected]> Co-authored-by: rnwang04 <[email protected]> Co-authored-by: Zhai, Xuejun <[email protected]> Co-authored-by: Luwei Zhou <[email protected]> Co-authored-by: Wang Wangwang <[email protected]>
@vshampor @ceciliapeng2011 Let's continue the discussion in CVS-175590 (GPU KV block size) and CVS-175120 (test XATTN with CPU) |
Agreed in chat that continue the merge of this PR and follow up discussion.
|
|
||
|
|
||
| def get_scheduler_config_genai(cb_config): | ||
| def configure_sparse_attention(scheduler_params, scheduler_config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#2895 is working on the same purpose.
-- [GPU] XAttention as a preview feature openvino#32064
-- [GPU] XAttention as a preview feature openvino#32551
Tickets: CVS-173845